DAT405 Introduction to Data Science and AI

Assignment 1: Introduction to Data Science and Python

Student name Hours spent on the tasks
Lenia Malki 10
Maële Belmont 10

Comment on the code

If the plots are not displayed when you open the notebook and the pdf, please either

Sorry for the inconvenience.

Setup

Python modules need to be loaded to solve the tasks.

Task 1 - Download some data related to GDP per capita and life expectancy

A. Write a Python program that draws a scatter plot of GDP per capita vs life expectancy. State any assumptions and motivate decisions that you make when selecting data to be plotted, and in combining data. [1p]

When extracting the data, we chose to not include population size as a parameter. As we are primarily interested in exploring the relationship between life expectancy and GDP, we did not consider the population size data to be contributing to this relationship. Disincluding this data makes it easier to focus and study the other two data entities.

In terms of countries, we chose to remove data points with null values in the columns of interest ('Life expectancy' and 'GDP per capita'). Among the available data, we chose only to focus on the data obtained in 2018. The reason for this lies in the assumption that recent data, in this context, is more accurate as it is more available today than it might have been in the nineteen hundreds.

In order to visualize the data in a readable manner, we chose to group countries together by their respective continents and color code these. The continents for each entity were defined only in the year 2015, thus we found a way (describe in the code) to replace NaN values in the 'Continent' column. To minimize cluttering, we used Plotly to create an interactive plot, which displays more information (country and exact value of life expectancy and GDP per capita) when the cursor is on a dot. GDP per capita (x-axis) was plotted on a log scale to avoid cluttering on the left side of the plot.

It is possible to study any data between the year 1543 and 2018. In other words, the program is not limited to only data from one specific year though only data for one year is plotted at a time.


B. Consider whether the results obtained seem reasonable and discuss what might be the explanation for the results you obtained. [1p]

The results show a clear trend. Countries with a higher GDP tend to also rank higher in life expectancy. However, this does not mean that it is always true. We can for example see that Saudi Arabia has one of the higher GDP per capita at approximately \$50,000 and a life expectancy of 75 years. On the same y-coordinate, we find Honduras which is one of Latin America's poorest countries with a GDP of \\$5,000. Even though it seems to be a positive trend related to GDP per capita and life expectancy, deviations from the regression can reveal otherwise. Generally speaking, countries with higher GDP per capita might have the ability to provide better health care to their population as well as better living standards for the people, thus resulting in a higher life expectancy. One must however consider that other factors are at play when looking at life expectancy, not only GDP per capita.

C. Did you do any data cleaning (e.g., by removing entries that you think are not useful) for the task of drawing scatter plot(s) and the task of answering the questions d, e, f, and g? If so, explain what kind of entries that you chose to remove and why. If not, explain why you did not need to. [0.5p]

As mentioned in question 1.a, we decided to remove the population size data in order to only focus on the relationship between GDP and life expectancy. Data points with null values, such as for example missing GDP data or life expectancy score, were also removed in order to avoid outliers. Lastly, we decided to only collect and visualize the data from 2018. The reasoning behind this has to do with the assumption that the quality and availability of recent data, within this context, is better than that of much earlier years. That being said, we do not believe there to be big differences between closeby years.

D. Which countries have a life expectancy higher than one standard deviation above the mean? [0.5p]

With a standard deviation of approximately 7.747 and a mean of approximately 2.66, one standard deviation above the mean would require the life expectancy to be at a minimum of 80.41. The countries with a life expectancy of one standard deviation above the mean would be those presented in Figure 2. The data for these life expectancies are limited to that of a specific year supplied by the user. In this case, the input year was 2018.

E. Which countries have high life expectancy but have low GDP (per capita)? [0.5p]

It essentially depends on how you define high life expectancy and low GDP. If we were to define low GDP as one standard deviation below the mean, it would yield a negative GDP score. This is because the standard deviation is greater than the mean, indicating a high variance between data points. Another way of defining low GDP would be to extract those data points whose GDP is below the median.

The GDP median is 12165.79 with a life expectancy median of 74.368. We can see that these countries, listed in Figure 3, are located closer to the upper left corner of the graph.

F. Does every strong economy (normally indicated by GDP) have high life expectancy? [1p]

For the year 2018, we can see a steep trendline, indicating a strong relationship however, the variance between the data points is quite high. This can be seen by the spread out data. It is not always the case that countries with higher GDP also score better on life expectancy. For example, India with a GDP of around 9.2T have a life expectancy of 69.41 whilst Sao Tome and Principie has a GDP of 787.192M for approximetaly the same life expectancy. In conclusion, GDP per capita is a better indicator for the relationship of life expectancy and GDP per capita.

The correlation of life expectancy and GDP per capita is stronger than that of life expectancy and GDP only. This is clear when comparing the same year with each other. The variance is visabily greater in the graph from Figure 4 than that of Figure 1. There are however some similarities found in both graphs. For example, the majority of countries in Africa tend to show up below the regression line and more left of the graph. The regression line in 1.a is also steaper, indicating a stronger relationship. Using GDP per capita can thus provide a clearer indication of the country's prosperity.

Task 2 - Download some other data sets, e.g. related to happiness and life satisfaction, trust, corruption, etc.

A. Think of several meaningful questions that can be answered with these data, make several informative visualisations to answer those questions. State any assumptions and motivate decisions that you make when selecting data to be plotted, and in combining data. [2.5p]

B. Discuss any observations that you make, or insights obtained, from the data visualisations. [2p]

5 datasets were downloaded for this task. We asked then answered one question per datasets and wrote a conclusion comment at the end.

In the following figures, we plotted data for the most recent year available to make analyses that are more likely to represent the current situation.

We used the same functions as in task 1 since the data downloaded has the same format as 'Life expectancy vs. GDP per capita'.

The visualization decisions are the same as in question 1.a.

1. Life satisfaction vs. GDP per capita

Which countries have low life satisfaction but have high GDP per capita?
To determine the countries with a low life satisfaction and high GDP per capita, the data was filtered to display countries with a life satisfaction below the median and a GDP per capita above the median. The countries respecting the criterium are located in the bottom right corner of the plot (Figure 5). The GPD per capita median is \$18,278, while the life satisfaction median is 5.81. The results are displayed in Figure 6.


2. Children per women vs. GDP per capita

Is the number of children per women positively correlated to the GDP per capita ?
The trendline has a negative slope, which implies that the number of children per women is negatively correlated to the GDP per capita. The highest numbers of children per woman are observed in the upper left corner on Figure 8, where the GDP per capita is lower than the median (\$11,815). The predominant color of the dots is red in Figure 7, indicating that the countries with the most children per woman are located in Africa.


3. Share of adults who smoke vs. GDP per capita

Is there a trend between the share of adults who smoke vs. GDP per capita ?
The data is scattered, the variance is significant, compared to datasets previously analysed. The share of adults who smoke appears to be the lowest in African countries, where the GDP per capita is the lowest (Figure 9). Countries with a high GDP per capita have a slightly higher share of smokers than countries in Africa. The countries with the highest share of adults who smoke have a GDP per capita lower than the median (\$14,253). Apart from these observations, no specific conclusions can be drawn because $R^{2}$ is close to zero, which indicates that the data is in principle not suitable for regression. This also implies the correlation between the share of adults who smoke and the GDP per capita is low.


4. Medical doctors per 1,000 people vs. GDP per capita

A interesting question to ask here is whether countries with higher GDP per capita have less or more medical doctors per 1.000 people. The graph for 2018 (Figure 11) shows several interesting facts. Up until 6.5K of GDP, the variance seems to be quite low and there seem to exist a relationship between medical doctors and GDP per capita. However, this trend is not really clear. From 10k and greater, the data points are much more spread. There are also some extremes such as Georgia and Lithuania. Overall, there seem to be a positive trend. Those countries with lower GPD and which are considered developing countries are once again located on the lower left corner of the graph. This does make intuitively sense as education is not as accessible in these countries.


5. Child mortality vs. GDP per capita

Child mortality is defined as the number of children born alive that die before their 5th birthday.

How has child mortality evolved from 1986 to 2016 (30 years)?

First of all, we can see that the trendline starts at around 21% in 1986 (Figure 13) whilst it starts at around 9% in 2016 (Figure 14) which indicates that the overall percentage of child mortality has decreased. We can also find most of African countries on the upper left side of the graph. As we have seen earlier, Africa has many developing countries with lower GDPs (per capita). A much lower GDP per capita can be the cause of higher child mortality rates as the trendline in both graphs are quite evident. There seem to be a higher variance between the countries in 2016 as opposed to 1986, at leats for most of Africa and some of Asia. One can also argue that countries with a lower GDP and greater population size tend to have a greater child mortaility rate.


Overall, one can conclude that countries with a low GDP have more issues related to health. By relating the observations drawn from the datasets, one could argue that the life expectancy and child mortality rate could be explained by the number of doctors per 1,000 people. Countries with low GDP per capita have less doctors, a higher child mortality rate and a lower life expectancy.

References

Task 1

  1. Data was compiled by Our World in Data (2022): Life expectancy vs. GDP per capita. [online] Available at: https://ourworldindata.org/grapher/life-expectancy-years-vs-real-gdp-per-capita-2011us [Accessed 20 January 2022].
    Based on estimates by:
    1. Life expectancy: James C. Riley (2005) – Estimates of Regional and Global Life Expectancy, 1800–2001. Issue Population and Development Review. Population and Development Review. Volume 31, Issue 3, pages 537–543, September 2005., Zijdeman, Richard; Ribeira da Silva, Filipa, 2015, "Life Expectancy at Birth (Total)", http://hdl.handle.net/10622/LKYT53, IISH Dataverse, V1, and UN Population Division (2019)
    2. GDP per capita: Bolt, Jutta and Jan Luiten van Zanden (2020), “Maddison style estimates of the evolution of the world economy. A new 2020 update”.
  2. Plotly. 2022. Plotly Python Graphing Library. [online] Available at: https://plotly.com/python/ [Accessed 23 January 2022].
  3. Plotly. 2022. Linear Fits. [online] Available at: https://plotly.com/python/linear-fits/ [Accessed 23 January 2022].

Task 2

  1. Data compiled by Our World in Data (2022): Self-reported Life Satisfaction vs GDP per capita. [online] Available at: https://ourworldindata.org/grapher/gdp-vs-happiness [Accessed 20 January 2022].
    Based on estimates by:
    1. Life satisfaction: World Happiness Report (2021) [online] Available at: https://worldhappiness.report/ed/2021/#appendices-and-data [Accessed 20 January 2022].
    2. GDP per capita: International Comparison Program - World Bank, World Development Indicators - World Bank, Eurostat-OECD PPP Programme (2021) [online] Available at: http://data.worldbank.org/data-catalog/world-development-indicators [Accessed 20 January 2022].
  2. Data compiled by Our World in Data. 2022. Children per woman by GDP per capita. [online] Available at: https://ourworldindata.org/grapher/children-per-woman-by-gdp-per-capita [Accessed 20 January 2022].
    Based on estimates by:
    1. Children per woman: United Nations, Department of Economic and Social Affairs, Population Division (2019). World Population Prospects: The 2019 Revision, DVD Edition. [online] Available at: https://population.un.org/wpp2019/Download/Standard/Interpolated/ [Accessed 20 January 2022].
    2. GDP per capita: Feenstra, Robert C., Robert Inklaar and Marcel P. Timmer (2015), "The Next Generation of the Penn World Table" American Economic Review, 105(10), 3150-3182, available for download at www.ggdc.net/pwt. PWT v9.1
  3. Data compiled by Our World in Data. 2022. Share of adults who smoke vs GDP per capita. [online] Available at: https://ourworldindata.org/grapher/share-of-adults-who-are-smoking-by-level-of-prosperity [Accessed 23 January 2022].
    Based on estimates by:
    1. Share of adults who smoke: Global Health Observatory Data Repository - World Health Organization (2021) [online] Available at: http://data.worldbank.org/data-catalog/world-development-indicators [Accessed 23 January 2022].
    2. GDP per capita: International Comparison Program - World Bank, World Development Indicators - World Bank, Eurostat-OECD PPP Programme (2021). [online] Available at: http://data.worldbank.org/data-catalog/world-development-indicators [Accessed 23 January 2022].
  4. Data compiled by Our World in Data. 2022. Medical doctors per 1,000 people vs. GDP per capita. [online] Available at: https://ourworldindata.org/grapher/medical-doctors-per-1000-people-vs-gdp-per-capita [Accessed 23 January 2022].
    Based on estimates by:
    1. Medical doctors per 1,000 people: Global Health Workforce Statistics - World Health Organization, OECD, official national sources (2021) [online] Available at: http://data.worldbank.org/data-catalog/world-development-indicators [Accessed 23 January 2022].
    2. GDP per capita: International Comparison Program - World Bank, World Development Indicators - World Bank, Eurostat-OECD PPP Programme [online] Available at: http://data.worldbank.org/data-catalog/world-development-indicators [Accessed 23 January 2022].
  5. Data compiled by Our World in Data. 2022. Child mortality vs GDP per capita. [online] Available at: https://ourworldindata.org/grapher/child-mortality-gdp-per-capita [Accessed 20 January 2022].
    Based on estimates by:
    1. Child mortality: Gapminder [online] Available at: https://www.gapminder.org/data/documentation/gd005/ [Accessed 20 January 2022].
    2. GDP per capita: The Maddison Project Database [online] Available at: https://www.rug.nl/ggdc/historicaldevelopment/maddison/releases/maddison-project-database-2020 [Accessed 20 January 2022].